The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
translated by 谷歌翻译
Recently, the dominant DETR-based approaches apply central-concept spatial prior to accelerate Transformer detector convergency. These methods gradually refine the reference points to the center of target objects and imbue object queries with the updated central reference information for spatially conditional attention. However, centralizing reference points may severely deteriorate queries' saliency and confuse detectors due to the indiscriminative spatial prior. To bridge the gap between the reference points of salient queries and Transformer detectors, we propose SAlient Point-based DETR (SAP-DETR) by treating object detection as a transformation from salient points to instance objects. In SAP-DETR, we explicitly initialize a query-specific reference point for each object query, gradually aggregate them into an instance object, and then predict the distance from each side of the bounding box to these points. By rapidly attending to query-specific reference region and other conditional extreme regions from the image features, SAP-DETR can effectively bridge the gap between the salient point and the query-based Transformer detector with a significant convergency speed. Our extensive experiments have demonstrated that SAP-DETR achieves 1.4 times convergency speed with competitive performance. Under the standard training scheme, SAP-DETR stably promotes the SOTA approaches by 1.0 AP. Based on ResNet-DC-101, SAP-DETR achieves 46.9 AP.
translated by 谷歌翻译
在狭窄的空间中,基于传统层次自治系统的运动计划可能会导致映射,定位和控制噪声引起碰撞。此外,当无映射时,它将被禁用。为了解决这些问题,我们利用深厚的加强学习,可以证明可以有效地进行自我决策,从而在狭窄的空间中自探索而无需地图,同时避免碰撞。具体而言,基于我们的Ackermann-Steering矩形Zebrat机器人及其凉亭模拟器,我们建议矩形安全区域来表示状态并检测矩形形状的机器人的碰撞,以及无需精心制作的奖励功能,不需要增强功能。目的地信息。然后,我们在模拟的狭窄轨道中基准了五种增强学习算法,包括DDPG,DQN,SAC,PPO和PPO-DISCRETE。经过训练,良好的DDPG和DQN型号可以转移到三个全新的模拟轨道上,然后转移到三个现实世界中。
translated by 谷歌翻译
多亏了机器人技术的快速发展,机器人割草正在兴起,使人类摆脱了繁琐且耗时的景观工作。传统上,机器人割草被认为是“覆盖道路计划”问题,简化了将非凸障碍转换为凸障碍的障碍。此外,机器人的包围通常会扩张转换后的障碍物以避免碰撞。但是,当适用于机器人割草时,草坪上的障碍通常是非凸的,请想象一下草坪上的一个花园,这样上面提到的障碍物处理方法将填补某些凹面区域,以使机器人再也无法访问了它们,因此沿着草坪边缘产生不可避免的未切割区域,从而使景观的优雅降低并激发了返工。为了缩小草坪边缘周围的未切割区域,我们在此将问题重新构架为一个全新的问题,称其为“边缘覆盖路径计划”问题,该问题专门用于路径计划,以覆盖边缘。相应地,我们提出了两种计划方法,即“大小磁盘”和“滑动筷子”计划方法,以通过利用图像形态处理和计算几何技巧来解决问题。通过验证,我们提出的方法可以胜过传统的“逐一扩张”方法。
translated by 谷歌翻译
在鸟眼中学习强大的表现(BEV),以进行感知任务,这是趋势和吸引行业和学术界的广泛关注。大多数自动驾驶算法的常规方法在正面或透视视图中执行检测,细分,跟踪等。随着传感器配置变得越来越复杂,从不同的传感器中集成了多源信息,并在统一视图中代表功能至关重要。 BEV感知继承了几个优势,因为代表BEV中的周围场景是直观和融合友好的。对于BEV中的代表对象,对于随后的模块,如计划和/或控制是最可取的。 BEV感知的核心问题在于(a)如何通过从透视视图到BEV来通过视图转换来重建丢失的3D信息; (b)如何在BEV网格中获取地面真理注释; (c)如何制定管道以合并来自不同来源和视图的特征; (d)如何适应和概括算法作为传感器配置在不同情况下各不相同。在这项调查中,我们回顾了有关BEV感知的最新工作,并对不同解决方案进行了深入的分析。此外,还描述了该行业的BEV方法的几种系统设计。此外,我们推出了一套完整的实用指南,以提高BEV感知任务的性能,包括相机,激光雷达和融合输入。最后,我们指出了该领域的未来研究指示。我们希望该报告能阐明社区,并鼓励对BEV感知的更多研究。我们保留一个活跃的存储库来收集最新的工作,并在https://github.com/openperceptionx/bevperception-survey-recipe上提供一包技巧的工具箱。
translated by 谷歌翻译
眼科图像和衍生物,例如视网膜神经纤维层(RNFL)厚度图对于检测和监测眼科疾病至关重要(例如,青光眼)。对于计算机辅助诊断眼疾病,关键技术是自动从眼科图像中提取有意义的特征,这些特征可以揭示与功能视觉丧失相关的生物标志物(例如RNFL变薄模式)。然而,将结构性视网膜损伤与人类视力丧失联系起来的眼科图像的表示,主要是由于患者之间的解剖学变化很大。在存在图像伪像的情况下,这项任务变得更加具有挑战性,由于图像采集和自动细分,这很常见。在本文中,我们提出了一个耐伪造的无监督的学习框架,该框架称为眼科图像的学习表示。 Eyelearn具有一个伪影校正模块,可以学习可以最好地预测无伪影眼镜图像的表示形式。此外,Eyelearn采用聚类引导的对比度学习策略,以明确捕获内部和间形的亲和力。在训练过程中,图像在簇中动态组织,以形成对比样品,其中鼓励在相同或不同的簇中分别学习相似或不同的表示形式。为了评估包冰者,我们使用青光眼患者的现实世界眼科摄影图数据集使用学习的表示形式进行视野预测和青光眼检测。广泛的实验和与最先进方法的比较验证了眼球从眼科图像中学习最佳特征表示的有效性。
translated by 谷歌翻译
否决单图是一项普遍但又具有挑战性的任务。复杂的降雪降解和各种降解量表需要强大的代表能力。为了使否定的网络看到各种降雪并建模本地细节和全球信息的上下文相互作用,我们提出了一种称为Snowformer的功能强大的建筑。首先,它在编码器中执行比例感知功能聚合,以捕获各种降解的丰富积雪信息。其次,为了解决大规模降级,它使用了解码器中的新颖上下文交互变压器块,该互动器块在全球上下文交互中从前范围内的局部细节和全局信息进行了上下文交互。并引入本地上下文互动可改善场景细节的恢复。第三,我们设计了一个异质的特征投影头,该功能投影头逐渐融合了编码器和解码器的特征,并将精制功能投影到干净的图像中。广泛的实验表明,所提出的雪诺形雪孔比其他SOTA方法取得了重大改进。与SOTA单图像HDCW-NET相比,它在CSD测试集上将PSNR度量提高了9.2dB。此外,与一般图像恢复体系结构NAFNET相比,PSNR的增加5.13db,这验证了我们的雪诺形雪地降雪任务的强大表示能力。该代码在\ url {https://github.com/ephemeral182/snowformer}中发布。
translated by 谷歌翻译
了解人类情绪是智能机器人提供更好的人类机器人相互作用的关键能力。现有作品仅限于修剪视频级别的情感分类,无法找到与情感相对应的时间窗口。在本文中,我们介绍了一项新任务,称为视频中的时间情感本地化(TEL),该任务旨在检测人类的情感并将其相应的时间边界定位在带有校准字幕的未修剪视频中。与时间动作本地化相比,TEL提出了三个独特的挑战:1)情绪的时间动态极为多样; 2)情绪提示都嵌入了外观和复杂的情节中; 3)细粒度的时间注释是复杂且劳动密集型的。为了应对前两个挑战,我们提出了一个新颖的扩张上下文集成网络,该网络与粗细的两流体系结构。粗流通过建模多粒性时间上下文来捕获各种时间动力学。细流通过推理从粗流的多晶格时间上下文之间的依赖性来实现复杂的理解,并将它们自适应地集成到细粒度的视频段特征中。为了应对第三个挑战,我们引入了跨模式共识学习范式,该范式利用了对齐视频和字幕之间的固有语义共识,以实现弱监督的学习。我们为新的测试集提供了3,000个手动注释的时间边界,因此可以对TEL问题进行未来的研究进行定量评估。广泛的实验显示了我们方法对时间情绪定位的有效性。这项工作的存储库位于https://github.com/yyjmjc/temporal-emotion-localization-in-videos。
translated by 谷歌翻译
Subject to the huge semantic gap between natural and formal languages, neural semantic parsing is typically bottlenecked by its complexity of dealing with both input semantics and output syntax. Recent works have proposed several forms of supplementary supervision but none is generalized across multiple formal languages. This paper proposes a unified intermediate representation (IR) for graph query languages, named GraphQ IR. It has a natural-language-like expression that bridges the semantic gap and formally defined syntax that maintains the graph structure. Therefore, a neural semantic parser can more precisely convert user queries into GraphQ IR, which can be later losslessly compiled into various downstream graph query languages. Extensive experiments on several benchmarks including KQA Pro, Overnight, GrailQA, and MetaQA-Cypher under standard i.i.d., out-of-distribution, and low-resource settings validate GraphQ IR's superiority over the previous state-of-the-arts with a maximum 11% accuracy improvement.
translated by 谷歌翻译